A hybrid statistical/RNN approach to prosody synthesis for taiwanese TTS
نویسندگان
چکیده
m is a spoken dialect widely used in the south-eastern China and Taiwan. Just like Mandarin, Min-Nan speech is a monosyllabic and tonal language. Each character is pronounced as a syllable carrying a lexical tone. There are only 877 base-syllables and 8 tones including a degenerated one which is not used in modern Taiwanese. These 877 basesyllables also have the same initial-final structure like Mandarin base-syllables. There are 18 initials and 82 finals. Although tonal syllables are the basic pronunciation units, Min-Nan speech is colloquial and does not have a standard written form. There are two popular written forms used in Taiwan. One is the Romanization form which uses Roman alphabets to spell each base-syllable and uses a number to specify its tone. The other is a hybrid one which uses Chinese characters to represent ordinary words and represents some extraordinary syllables in the Romanization form. It is noted that for the hybrid written form the system to represent words in Chinese characters is still not standardized nowadays in Taiwan. This makes the text analysis very difficult for Min-Nan language. In the past, there are very few studies in Min-Nan TTS. There are only a few preliminary studies in Taiwan in recent years [1-3]. This is mainly due in part to the lack of a standardized written-form representation as mentioned above and in part to the difficulty of collecting a large speech database. In this paper, a hybrid statistical/RNN approach to prosodic information synthesis for Min-Nan TTS is proposed. It tries to first model some primary factors affecting the generation of prosodic information by the statistical-based method and then uses RNN to take care of all other affecting factors. Advantages of the new approach as compared with the conventional RNNbased approach are two-folded: (1) It provides more consistent training data to train the RNN via suppressing some interference factors such as speaking rate variation; and (2) It removes some loads of the RNN prosodic modeling. The paper is organized as follows. Section 2 presents the proposed hybrid prosody synthesis method. Two schemes of the proposed method are discussed. Performance of these two schemes are evaluated by experiments discussed in Section 3. Some conclusions are given in the last section.
منابع مشابه
A Corpus-Based Prosodic Modeling Method for Mandarin and Min-Nan Text-to-Speech Conversions
This talk gives an introduction to a recurrent neural network (RNN) based prosody synthesis method for both Mandarin and Min-Nan text-tospeech (TTS) conversions. The method uses a fourlayer RNN to model the dependency of output prosodic information and input linguistic information. Main advantages of the method are the capability of learning many human’s prosody pronunciation rules automaticall...
متن کاملAn NN-based Approach to Prosody Generation for English Word Spelling in English-Chinese Bilingual TTS
In this paper, an RNN-MLP-based scheme to generate proper prosodic information for spelling English words embedded in Chinese text background is proposed. It is extended from the RNN prosody synthesis scheme of an existing Mandarin TTS by adding four MLPs to follow the RNN. It first treats each English word as a Chinese word and uses the RNN to generate eight prosodic parameters for each alphab...
متن کاملA Hybrid TTS Approach for Prosody and Acoustic Modules
Unit selection (US) TTSs generate quite natural speech but highly variable in quality. Statistical parametric (SP) systems offer far more consistent quality but reduced naturalness due to its vocoding nature. We present a hybrid approach (HA) that tries to improve the overall naturalness combining both synthesis methods. Contrary to other works, the fusion of methods is performed both in prosod...
متن کاملA Taiwanese (min-nan) text-to-speech (TTS) system based on automatically generated synthetic units
A Taiwanese (Min-nan) Text-to-Speech (TTS) system has been constructed in this paper based on automatically generated synthetic units by considering several specific phonetic and linguistic characteristics of Taiwanese. Some basic facts about Taiwanese useful in a TTS system is summarized, including the issues of tone sandhi, the writen format and the others. Three functional modules, namely a ...
متن کاملSome Studies on Min-Nan Speech Processing
In this paper, three studies of Min-Nan speech processing are presented. The first study concerns the implementation of a high-performance Min-Nan TTS system. On the basis of the waveform templates of 877 base-syllables used as basic synthesis units and through the application of the RNN-based prosody generation method and the PSOLA algorithm for prosody modification, this Min-Nan TTS system ca...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000